Pruning nearest neighbor cluster trees
نویسندگان
چکیده
Nearest neighbor (k-NN) graphs are widely used in machine learning and data mining applications, and our aim is to better understand what they reveal about the cluster structure of the unknown underlying distribution of points. Moreover, is it possible to identify spurious structures that might arise due to sampling variability? Our first contribution is a statistical analysis that reveals how certain subgraphs of a k-NN graph form a consistent estimator of the cluster tree of the underlying distribution of points. Our second and perhaps most important contribution is the following finite sample guarantee. We carefully work out the tradeoff between aggressive and conservative pruning and are able to guarantee the removal of all spurious cluster structures at all levels of the tree while at the same time guaranteeing the recovery of salient clusters. This is the first such finite sample result in the context of clustering.
منابع مشابه
On the Generalization of Nearest Neighbor Queries
Nearest neighbor queries on R-trees use a number of pruning techniques to improve the search. We examine three common 1-nearest neighbor pruning strategies and generalize them to k-nearest neighbors. This generalization clears up a number of prior misconceptions. Specifically, we show that the generalization of one pruning technique, referred to as strategy 2, is non-trivial and requires the in...
متن کاملSearch Space Reduction in R-trees
Pruning plays an integral role in reducing the search space of nearest neighbor queries on data structures like the R-tree. We show that a popular pruning strategy for nearest queries can reduce the search space exponentially in R-trees. In light of this, we provide a generalization of the strategy to k-nearest neighbors. We call the extension Promise-Pruning and, for any k, construct a class o...
متن کاملOptimizing Search Strategies in k-d Trees
While k-d trees have been widely studied and used, their theoretical advantages are often not realized due to ineffective search strategies and generally poor performance in high dimensional spaces. In this paper we outline an effective search algorithm for k-d trees that combines an optimal depth-first branch and bound (DFBB) strategy with a unique method for path ordering and pruning. Our ini...
متن کاملSearch Space Reductions for Nearest-Neighbor Queries
The vast number of applications featuring multimedia and geometric data has made the R-tree a ubiquitous data structure in databases. A popular and fundamental operation on R-trees is nearest neighbor search. While nearest neighbor on R-trees has received considerable experimental attention, it has received somewhat less theoretical consideration. We study pruning heuristics for nearest neighbo...
متن کاملTree-Independent Dual-Tree Algorithms
Dual-tree algorithms are a widely used class of branch-and-bound algorithms. Unfortunately, developing dual-tree algorithms for use with different trees and problems is often complex and burdensome. We introduce a four-part logical split: the tree, the traversal, the point-to-point base case, and the pruning rule. We provide a meta-algorithm which allows development of dual-tree algorithms in a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011